Day 1 Session 1:
Static spectral analysis
using Principal Component Analysis

Introduction

Welcome to the workshop Analysing contours in phonetic data: Dynamic analysis hands-on workshop using R! In this workshop, we are going to study two main approachces to dynamic data analysis that are often used in phonetics.

The overarching aim of this workshop is to enable you to attempt different approaches to analysing phonetic data. Specifically, I aim to showcase the following contrasts in data analysis methods:

1. Static vs dynamic data analysis

Static analysis is a common approach to formant analysis in phonetics research. It is often the case, for example, that we extract formant frequencies during a vocalic interval at a single point in time (e.g., segmental onset, temporal midpoint). This is a farily prevalent approach and it is useful when we try to understand general and abstract properties of segments.

While static analysis is a somewhat easy and computationally less expensive approach, this inevitably involves some degrees of data loss/compression. Recent studies have increasingly demonstrated that time-varying properties in speech signals provide rather fine-grained phonetic details that may be important to better understand e.g., synchronic and diachronic variation and change, and cross-linguistic differences/influences in the acquisition of phonetics/phonology. And this is where dynamic analysis methods will come in handy, in which we directly model non-linearity in the data without having to compress/losing much information. My motto in dynamic analysis is modelling what you see.

2. Top-down vs bottom-up approaches to statistical modelling

Another contrast that will be discussed in this workshop is the difference in the way we deal with hypotheses. On the one hand, we may be interested in seeing what the data show us when we do not have any prior knowledge. In other words, our study can be exploratory, and we sometimes prefer data-driven (or bottom-up) approach to identifying key variables of interest, which then feed hypothesis/research question generations.

On the other hand, statistical modelling can start with formulating hypotheses and research questions based on some existing ideas of key variables of interest, which could derive from e.g., previous research. In a way, this process involves a top-down decision making, and hypothesis testing involves asking whether the key variables of interest have a (statistically) significant effect in distinguishing two different populations.

The contrast between bottom-up (data-driven) and top-down (hypothesis-driven) approaches to data analysis influence the nature of research questions that can be asked using each method.

Summary

In the next two days, we will be mainly discussing four statistical methods corresponding to the contrasts above.

  1. Principal Component Analysis (PCA) : static and bottom-up approach

  2. Functional Principal Component Analysis (FPCA): dynamic and bottom-up approach

  3. Linear Mixed-Effect Models (LMEs): static and top-down approach

  4. Generalised Additive Mixed-Effect Models (GAMMs) : dynamic and top-down approach

Note that I do not wish to advocate one method over another: the main objective of this workshop is for us to experience that the same sets of data can be analysed differently, that data visualisation can look different, and that we can ask different research questions from the same sets of data.

In the next section, I will introduce a problem that we will be working on in the course of the workshop. Given the nature of my expertise, I only manage to contexualise the workshop on a somewhat specific research topic, but hopefully the workshop offers something exciting for everyone!

Contexualising the workshop

L1 Japanese speakers’ production of L2 English /l/ and /ɹ/

I contexualise this workshop broadly in the second language (L2) speech learning. More specifically, my research addresses the ‘’Japanese /r/-/l/ problem’’ (Flege, Aoyama & Bohn, 2021, p. 84). There is a long history of research on this topic, in which the first well-known, influential empirical studies could date back to 1970’s (e.g., Goto et al., 1971; Miyawaki et al., 1975). Up until today, lots of researchers have worked on this topic to explain the substantial difficulty that L1 Japanese speakers have in perceiving and producing L2 English liquids (e.g., Sheldon & Strange, 1982; Flege et al., 1995; Bradlow et al., 1997; 1999; Iverson et al., 2003; Aoyama et al., 2004; Saito & Munro, 2014; Saito & van Poeteren, 2018; Shinohara & Iverson, 2018; Aoyama et al., 2019; Aoyama et al., 2023).

Overarching question

There have been quite a few studies investigating L1 Japanese speakers’ production of L2 English liquids. To our surprise, however, there are not many studies in which the production data are analysed instrumentally using acoustic and/or articulatory methods. Previous instrumental acoustic studies would include: Flege et al. (1995), Saito and Munro (2014), Saito and van Poeteren (2018), Aoyama et al. (2019), Aoyama et al. (2023). This list is not exhaustive but does cover pretty much the main ones out there.

These studies commonly utilise static analysis in characterising the acoustic profiles of L1 Japanese speakers’ production of L2 English liquids. Thanks to these studies, we now understand the general acoustic properties involved in the production. These studies, however, have not managed to explain how and why L1 Japanese speakers end up producing non-target-like L2 English liquids.

So, in this workshop, we will explore ways to better explain the mechanism as to why L1 Japanese speakers have substantial difficulty in producing L2 English liquids in a target-like manner based on acoustic data. To rephrase this into a question:

What is the underlying mechanism of the L1 Japanese speakers’ difficulty in producing L2 English liquids in a target-like manner?

Workshop structure

In this two-day workshop, I hope to go through the whole procedure of data analysis from hypothesis generation to hypothesis testing. On Day 1, you will be faced with a new, unfamiliar data set, so we will explore the data structure and identify salient acoustic dimensions that are important in addressing the question. PCA and FPCA are strong tools at this stage for static and dynamic analyses. At the end of the day, ideally, we will be able to generate a hypothesis regarding between-group differences with specific acoustic parameters. On Day 2, we will model and test the hypothesis generated on Day 1 using LMEs and GAMMs.

Workshop materials

In this workshop, we will be using the materials from one of my previous publications. Data and codes for the analysis are publicly available on the Open Science Framework (OSF) repository, and we will retrieve the data sets from there.

If interested, the article can be accessed here (open access):

Nagamine, T. (2024). Formant dynamics in second language speech: Japanese speakers’ production of English liquids. The Journal of the Acoustical Society of America, 155(1), 479–495. https://doi.org/10.1121/10.0024351

Here is the URL for the online supplementary materials deposited on the following OSF repository:

https://osf.io/2phx5/

Statistics basics

Now, let’s start the fun part: data analysis! As mentioned earlier, we will build things up from the static analysis before going into the dynamic analysis.

I assume that people have varying degrees of experience in statistics, so let’s cover some basics first.

Mean and median: the distribution “centre”

The most common parameters in describing a given set of data are mean and median. Assuming the normal distribution, the most common type of data distribution in statistics, both mean and median represent something like the midpoint. But they differ in a small detail:

  • Mean: the sum of all values divided by the number of values. Sensitive to extreme values.

  • Median: the middle value when lining up all the data points from the smallest to the greatest. More robust to extreme values than the mean.

The figures below illustrate mean and median values in (1) normal and (2) right skewed distributions. You can see that mean and median values match on the normal distribution (left plot), whereas the mean score is ‘attracted’ towards the extreme values towards the left (right plot).

Standard deviation: the distribution “spread”

Along with the mean and median values, we often encounter standard deviation (SD). SD captures the degree of ‘’spread’’ of a given distribution, or the distance from the mean in the distribution.

SD tells us the degree of data coverage. For example, one of the distributions shown below has the mean of 50 and the SD of 30 (in the skyblue colour). This indicates that 68% of the data points are covered between -1SD and +1SD: that is, between 50 - 30 = 20 and 50 + 30 = 80.

Similarly, 95% of the data points are included within ±2SDs. In the skyblue distribution below, 95% of the data are covered between 50 - 30 \(\times\) 2 = -10 and 50 + 30 \(\times\) 2 = 110. Usually, the larger SD is, the more spread the distribution looks.

Linear vs non-linear data

Let’s take a quick look into what it means by ‘linear’ and ‘non-linear’ data. Both are the terms used to explain the relationship between two variables. A linear relationship can be expressed as a straight line, whereas a non-linear relationship requires a curve (or a contour).

The figure below exemplifies a linear and a non-linear relationship between X and Y. Generally, the plot with dots like this is called ‘scatter plots’. Here, the black dots show a linear relationship, where a certain amount of increase in X corresponds to another certain amount of increase in Y, and this X-Y correspondence is consistent throughout the range of X/Y values. This linear relationship can be expressed using a straight (black) line.

The yellow dots, on the other hand, show a non-linear relationship between X and Y variables. Although an increase in X corresponds to another increase in Y like in the linear data, the relationship is not consistent, especially towards the end of the x-axis. This relationship cannot be captured with a single line (compare the solid and dotted lines in yellow).

Data visualisation: scatter, box and violin plots

Finally, let’s take a quick look at data visualisation methods. I consider that data visualisation is the pivot in data analysis, and a great data analysis is facilitated by effecive data visualiastion. It is also a means to communicate the research findings with the audience.

Effective data visualisation often combines multiple types of plots. In my usual practice, I often combine three types of plots: scatter, box and violin plots.

  1. Scatter plot: In the figure below, the dots in skyblue colour represents individual data points. The mean value of the data is expressed as ▲.

  2. Box plot: Superimposed on them is the box plot. The box plot consists of a rectangle box, a solid horizontal line in the middle of the box, and ‘whiskers’ at the top and bottom of the box. Each of them represents key parameters of a distribution:

  • The solid horizontal line represents median.

  • Each end of the box represents first and third quartiles. Broadly speaking, you could understand that the box covers the middle 50% of the data, which is divided into half by the solid black line. The lower end of the box is the first quartile (Q1) and the upper end is the third quartile (Q3). Right in the middle is the median, which is the second quartile (Q2).

  • The length of the whiskers below and above the box is determined such that:

    • Lower end: Q1 - 1.5 \(\times\) IQR

    • Upper end: Q3 + 1.5 \(\times\) IQR

  1. Violin plot: Finally, the violin plot shows the density of the data distribution.

Data wrangling and visualisation

OK, so much for the basics, and let’s turn to bottom-up, exploratory approaches to data analysis using principal component analysis (PCA).

These tools can be useful when (1) reducing data dimensions into a manageable number of variables and (2) identifying the degrees to which each parameter contributes to the data structure.

We will run up to PCA in a moment, but let’s start with spending some time understanding data through data wrangling.

Preliminaries

Installing/loading packages

Let’s first install and load R packages that we are using in the static analysis section. The installation commands have been commented out, but please uncomment them and install the packages if you do not have them on your machine yet. We won’t need any specialised package for PCA as it is included as a base function.

# installing
# install.packages("tidyverse")
# install.packages("lme4")
# install.packages("lme4Test")
# install.packages("emmeans")

# importing
library(tidyverse)
library(lme4)
library(lmerTest)
library(emmeans)

Importing data set

Let’s import the data set. We are using the data set openly available on the Open Science Framework (OSF) repository.

# import the csv file "initial.liquid.static.csv" from the "data" directory and save it as "df_mid"
df_mid <- readr::read_csv("data/initial.liquid.static.csv")

Checking data

It’s always a good idea to spend some time inspecting the data set. I usually start with checking the column names using colnames().

# Let's check what columns the data frame contains
colnames(df_mid)
##  [1] "...1"           "file"           "speaker"        "language"      
##  [5] "duration"       "segment"        "word"           "f1"            
##  [9] "f2"             "f3"             "previous_sound" "next_sound"    
## [13] "percent"        "IsApproximant"  "IsAcoustic"     "gender"        
## [17] "omit"           "position"

According to the code above, the data frame contains the following columns:

  • file: The name of source audio files. A legacy information from the acoustic analysis.

  • speaker: Randomly generated speaker IDs consisting of a combination of letters and numbers.

  • language: Speaker’s first language: either English or Japanese

  • duration: Duration of the liquid phoneme.

  • segment: Types of the segment, either L (/l/) or R (/ɹ/)

  • word: Words that the speakers produced.

  • f1, f2 and f3: Formant frequencies in Hz.

  • previous_sound: The sound occurring before the word-initial liquid consonant. Empty as the words were uttered in isolation.

  • next_sound: The sound occurring after the word-initial liquid consonant. In this case, this shows vowel context.

  • percent: Proportional time during the liquid consonant. The data set only contains midpoint measurement, so this should be all 50(%).

  • IsApproximant: Auditory and visual judgement (based on the spectrogram) as to whether the liquid consonant can be classified as an approximant or not (i.e., an alveolar stop or flap).

  • IsAcoustic: Auditory and visual judgement as to the clarity/visibility of the formant structure on the spectrogram. Broadly corresponding to the recording quality.

  • gender: Speakers’ gender. Either male or female.

  • omit: The column indicating whether acoustic analysis tool omitted the token from the analysis. Irrelevant for the purpose of this workshop.

  • position: The syllabic (or word) position of the liquid consonant. All should be initial.

Data tidy-up: tidyverse basics

There are a number of columns in the data set but not all of them are relevant throughout. Let’s tidy up the data frame so that the subsequent analysis will be more straightforward.

Check data

First, we will inspect a summary of the data according to the columns that we have so far. Although not the most efficient, my approach is to combine dplyr::group_by() and dplyr::summarise() functions to summarise the data.

Do not forget dplyr::ungroup() at the end of each operation, too!

# Let's check the number of "approximant" tokens
df_mid |> 
  dplyr::group_by(IsApproximant) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsApproximant
##   <chr>        
## 1 yes
# Let's check the number of tokens of good recording quality
df_mid |> 
  dplyr::group_by(IsAcoustic) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##   IsAcoustic
##   <chr>     
## 1 yes
# How about 'omit'?
df_mid |> 
  dplyr::group_by(omit) |> 
  dplyr::summarise() |> 
  dplyr::ungroup()
## # A tibble: 1 × 1
##    omit
##   <dbl>
## 1     0

Omitting irrelavent columns

The codes above show that the data set only contains liquid tokens that are classified as an approximant, deemed as being of good recording quality and thus included in the acoustic analysis. This means that we can safely remove these columns. Let’s do this via dplyr::select() function. After that, we check the column names again using colnames(), showing that the three columns have been safely removed from the data frame.

# Remove columns 
df_mid <- df_mid |> 
  dplyr::select(-c(IsApproximant, IsAcoustic, omit))

Your turn

Please write a code below to check the updated column names:

# Check column names again
# ...

Renaming variables

Later on in this workshop, we will be modelling the formant frequencies as a function of vowel context. In terms of the current data set, this corresponds to the next_sound column, but let’s rename it into vowel to make it more intuitive.

Also, we will also convert the ARPABET notations into the IPA symbols.

df_mid <- df_mid |> 
  dplyr::mutate(
    vowel = case_when(
      next_sound == "AE1" ~ "/æ/",
      next_sound == "IY1" ~ "/i/",
      next_sound == "UW1" ~ "/u/",
    )
  )

Checking the number of participants, tokens…

Let’s also obtain some descriptive statistics here.

# number of participants
df_mid |> 
  dplyr::group_by(language) |> 
  dplyr::summarise(n = n_distinct(speaker)) |> 
  dplyr::ungroup()
## # A tibble: 2 × 2
##   language     n
##   <chr>    <int>
## 1 English     14
## 2 Japanese    31
# number of tokens per segment
df_mid |> 
  dplyr::group_by(segment) |> 
  dplyr::summarise(n = n()) |> 
  dplyr::ungroup()
## # A tibble: 2 × 2
##   segment     n
##   <chr>   <int>
## 1 L        1114
## 2 R        1192

Your turn

Is there anything else that you would like to know about the data set?

You can start with checking the column names to see what variables are available in the data set. Then, use dplyr::group_by(), dplyr::summarise() and dplyr::ungroup() functions to inspect the data.

# Check data further
# ...

Data visualisation: ggplot basics

Hopefully, you start to become relatively familiar with the data set through the checking and tidying-up process above. Now, in order to understand the overall trend of the data, let’s try some data visualisation. I heavily rely on the ggplot2 package in the tidyverse suite.

Preparation

Before visualising the data, we need to make sure that the formant frequencies are comparable across speakers. We know that we cannot directly compare raw formant frequencies because they are heavily influenced by the vocal tract characteristics depending on the speaker’s gender, sex, height etc.

To address this, let’s normalise the formant frequencies across speakers. There are various normalisation methods, but today we are just using z-score normalisation, in which the mean value for each speaker is expressed as zero and the standard deviation is scaled to one. This can be achieved via scale() function included in base R.

Let’s create f1z, f2z and f3z columns using the dplyr::mutate() function. When doing this, make sure that you do the scaling for each speaker by specifying dplyr::group_by(speaker).

df_mid <- df_mid |> 
  dplyr::group_by(speaker) |> # tell R to do the following iteration per speaker
  dplyr::mutate(
    f1z = as.numeric(scale(f1)), # scale f1 into z-score
    f2z = as.numeric(scale(f2)), # scale f2 into z-score
    f3z = as.numeric(scale(f3)) # scale f3 into z-score
  ) |> 
  dplyr::ungroup() # don't forget ungrouping

Let’s check the mean and SD for both raw and normalised formant values: just see F1 for now. Note that the mean z-scores do not seem to look zero, but this is because computers are not very good at dealing with very small numbers (e.g., decimals) and some fluctuations occur in computing the values.

# check mean and sd of raw/scaled F1 values for each speaker
df_mid |> 
  dplyr::group_by(speaker) |>
  dplyr::summarise(
    f1_mean = mean(f1),
    f1_sd = sd(f1),
    f1z_mean = mean(f1z),
    f1z_sd = sd(f1z)
  ) |> 
  dplyr::ungroup() 
## # A tibble: 45 × 5
##    speaker f1_mean f1_sd  f1z_mean f1z_sd
##    <chr>     <dbl> <dbl>     <dbl>  <dbl>
##  1 2d57ke     372.  74.7  3.13e-16      1
##  2 2drb3c     430.  50.6  3.04e-16      1
##  3 2zy9tf     358.  73.8 -3.20e-16      1
##  4 3bcpyh     374.  41.0  1.68e-16      1
##  5 3hsubn     390.  82.6 -2.39e-16      1
##  6 3pzrts     335.  27.7  6.52e-16      1
##  7 4ps8zx     392.  68.7 -3.06e-16      1
##  8 54i2ks     348.  65.9 -3.35e-17      1
##  9 5jzj2h     366.  76.8 -1.22e-16      1
## 10 5upwe3     343.  42.1  1.05e-16      1
## # ℹ 35 more rows

Your turn

Write a code chunk to inspect the mean and sd of raw/scaled F2 and F3 values for each speaker and for speaker group. Think how you group the data in dplyr::group_by().

# check mean and sd of raw/scaled F2 values for each speaker group
# ...

# check mean and sd of raw/scaled F3 values for each speaker group
# ...

Visualisation: Spectral characteristics

We can now compare F1 values across participant groups for each segment (/l/ and /ɹ/). For clearer visualisation, the following codes define the range of y-axis between -4 and 4, and some extreme values have been omitted from the plot.

In the code below, I visualise the f1z values using the scatter, box and violin plots. The line with ggplot() defines x- and y-axis, and the following three lines with geom_jitter(), geom_violin() and geom_boxplot() superimpose the plot on top of each other and define aesthetic. facet_grid() divides the plot into multiple levels of a variable to understand differences across conditions.

# F1
df_mid |> 
  # define x- and y-axis
  ggplot(aes(x = language, y = f1z)) +
  # scatter plot
  geom_jitter(aes(colour = language), width = 0.3, alpha = 0.4) +
  # superimposing violin plot on scatter plot
  geom_violin(alpha = 0.4) +
  # superimposing boxplot on scatter/violin plots
  geom_boxplot(width = 0.4, alpha = 0.4) +
  # drawing a horizontal line at y = 0 to facilitate interpretation
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.5) +
  # define colour scheme
  scale_colour_manual(values = cbPalette) + 
  # define limits on the y-axis
  ylim(c(-4, 4)) +
  # split the plot according to the ``segment`` and ``vowel`` variables
  facet_grid(segment ~ vowel) +
  # define axis and main labels
  labs(x = "L1", y = "F1 (z-normalised)", title = "F1 frequency") +
  # adjust the label on the y-axis to make it look better
  theme(strip.text.y = element_text(angle = 0))

Your turn

Could you try visualising f2z and f3z?

# F2
# df_mid |> ...
 
# F3
# df_mid |> ...

Overall, the data visualisation so far seems to suggest:

  1. It seems that vowel contexts may influence the F1 values, such that the /æ/ context yields the overall highest F1 values. The two groups of speakers do not seem to exhibit clear differences along the F1 dimension.

  2. There also seems to be the effect of vowel context, in which F2 is overall highest in the /i/ context. L1 Japanese speakers seem to produce English liquids with higher F2 in the /i/ context and lower F2 in the /u/ context than L1 English speakers.

  3. Whereas F3 seems comparable between the two speaker groups for /l/, L1 Japanese speakers seem to produce English /ɹ/ with a little higher F3 than L1 English speakers, although the magnitude of such between-group differences seem to vary depending on the vowel context.

Exploring visualisation further

Your turn

Data visualisation is a crucial aspect in data analysis and in open data science: Slight changes in data visualiastion would convey different messages to whoever see the plots. It is always beneficial (and fun!) to play around a little with ggplot. So, let’s stop for a moment and have a bit of fun in data visualisation.

Explore different ways to visualise the F1, F2 and F3 values. For example, you could:

  • throw different variables in the facet_wrap() command

  • plot raw F1, F2 and F3 values (i.e., simply typing f1 instead of f1z) to see how the two plots compare

  • adjust the alpha values to see what happens

  • completely change the x/y axis labels

  • change the order of geom_jitter(), geom_violin and geom_boxplot(), and/or comment out some of them to disable

  • adjust the y-axis range by adjusting the numerics in ylim()

  • etc etc…

# try your visualisation
# ...

Correlation

Let’s also take a slightly different approach. The objective of PCA is to draw straight lines along the dimension that shows the greatest variance. To inspect the variance in the data, we’ll plot the relationships between the three spectral dimensions.

# F1 and F2
df_mid |> 
  ggplot(aes(x = f1z, y = f2z)) +
  geom_point(aes(colour = segment), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ language)

# F2 and F3
df_mid |> 
  ggplot(aes(x = f2z, y = f3z)) +
  geom_point(aes(colour = segment), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ language)

# F1 and F3
df_mid |> 
  ggplot(aes(x = f1z, y = f3z)) +
  geom_point(aes(colour = segment), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_wrap(~ language)

As expected, English /l/ and /ɹ/ can be reliably distinguished along the F3 dimension, such that English /ɹ/ shows lower F3 than English /l/.

Temporal characteristics

OK, then how about duration? Let’s visualise them first, and then check how it correlates with each of the spectral parameters. We’ll use the original df_mid data frame again but convert the duration into millisecond.

Violin plot

# convert duration (s) into (ms)
df_mid <- df_mid |> 
  dplyr::group_by(speaker) |> 
  dplyr::mutate(
    duration_ms = duration * 1000,
    duration_ms_z = scale(duration_ms)
  ) |> 
  dplyr::ungroup()

# plot
df_mid |> 
  ggplot(aes(x = language, y = duration_ms_z)) +
  geom_jitter(aes(colour = language), alpha = 0.5) +
  geom_violin(alpha = 0.4) +
  geom_boxplot(width = 0.3, alpha = 0.4) +
  facet_grid(~ segment) +
  scale_colour_manual(values = cbPalette)

Correlation

Let’s also check whether duration correlates with any of the spectral measures.

# with F1 
df_mid |> 
  ggplot(aes(x = f1z, y = duration_ms_z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(segment ~ language) +
  theme(strip.text.y = element_text(angle = 0))

# with F2
df_mid |> 
  ggplot(aes(x = f2z, y = duration_ms_z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(segment ~ language) +
  theme(strip.text.y = element_text(angle = 0))

# with F3
df_mid |> 
  ggplot(aes(x = f3z, y = duration_ms_z)) +
  geom_point(aes(colour = language), alpha = 0.4) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(segment ~ language) +
  theme(strip.text.y = element_text(angle = 0))

Mmm, we do not seem to find particularly interesting correlations here. But this is fine for now, as we’re just exploring this. The key takeaway here is that we always need multiple plots in order to obtain a holistic understanding of the data structure, which is somewhat cumbersome. Here, PCA comes in handy, as it is a dimension reduction technique and boils down the variance in the data into a handful of principal components.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a dimensionality reduction technique that transforms data into a new coordinate system. Given data points that show relationships between two or more variables (e.g., x and y), PCA identifies new axes (principal components, PCs) that capture the maximum variance in the data:

  • PC1 (first principal component) captures the direction of the greatest variance in the data.

  • PC2 (second principal component) is perpendicular (orthogonal) to PC1 and captures the second largest variance.

If there are more dimensions, additional principal components (PC3, PC4, etc.) are defined similarly, always orthogonal to the previous ones.

After identifying these new principal components, the data is re-expressed in terms of these new axes. This effectively rotates the coordinate system so that PC1 aligns with the new x-axis and PC2 with the new y-axis. However, the relative positions of the data points remain unchanged.

Let’s apply PCA on the formant data. We’ll classify the tokens based on the z-scored F1, F2 and F3. We use the prcomp() function to perform PCA.

# rearrange column order for easier data subsetting
df_mid <- df_mid |> 
  dplyr::relocate(f1z, f2z, f3z, duration_ms_z)

# check colnames
colnames(df_mid)
##  [1] "f1z"            "f2z"            "f3z"            "duration_ms_z" 
##  [5] "...1"           "file"           "speaker"        "language"      
##  [9] "duration"       "segment"        "word"           "f1"            
## [13] "f2"             "f3"             "previous_sound" "next_sound"    
## [17] "percent"        "gender"         "position"       "vowel"         
## [21] "duration_ms"
# PCA on F1z, F2z, F3z and duration
pca_mid <- prcomp(dplyr::select(df_mid, 1:4))

# check summary
summary(pca_mid)
## Importance of components:
##                           PC1    PC2    PC3    PC4
## Standard deviation     1.1420 1.0411 0.9836 0.7537
## Proportion of Variance 0.3324 0.2762 0.2466 0.1448
## Cumulative Proportion  0.3324 0.6086 0.8552 1.0000

The summary shows three rows:

  • Standard deviation: This fundamentally shows the variance shown in the data. The greater the value, the more variance is captured by each PC.

  • Proportion of Variance: This expresses the amount of variance explained by each PC in a proportional manner. This can be calculated with the following formulae:

\[ pca\_var\_exp = \frac{(\text{sdev})^2}{\sum pca\_var} \]

  • Cumulative Proportion: This simply shows the sum of the proportion of variance explained by each PC. For instance, PC1 explains 32.76% of variance whereas PC2 26.92%. The cumulative proportion of PC1 and PC2 is thus 59.68% (i.e., 32.76% + 26.92%).

Another important index in PCA is eigenvalues, which represent how much variance is explained by each principal component (PC). The standard deviation of a PC is the square root of its eigenvalue, meaning that eigenvalues are the squared standard deviations of the PCs. The proportion of variance explained by a given PC is calculated by dividing its eigenvalue by the sum of all eigenvalues.

# calculate eigenvalues
eigenvalues <- (pca_mid$sdev)^2

# show eigenvalues
eigenvalues
## [1] 1.3041605 1.0838265 0.9675303 0.5681270
# calculating proportion of variance 
pca_var_exp <- eigenvalues / sum(eigenvalues)  # Proportion of variance

# show the proportion of variance explained by each PC
pca_var_exp
## [1] 0.3323850 0.2762295 0.2465897 0.1447957

PCA summary

Let’s also look into the inside of pca_mid.

# show the PCA results
pca_mid
## Standard deviations (1, .., p=4):
## [1] 1.1419985 1.0410699 0.9836312 0.7537420
## 
## Rotation (n x k) = (4 x 4):
##                      PC1          PC2        PC3        PC4
## f1z           0.02567994  0.554175433  0.7973415  0.2376484
## f2z           0.59519981  0.375814985 -0.4453013  0.5533598
## f3z           0.73795645 -0.003335683  0.1729687 -0.6522967
## duration_ms_z 0.31701420 -0.742725764  0.3688295  0.4602447

The top line shows standard deviation, which we just saw earlier, so let’s skip this for now.

At the bottom, we have loadings matrix. This shows eigenvectors: i.e., the amount and (2) the direction of the contribution that the original data dimensions make to each PC. Let’s disentangle this one by one:

  • PC1: Contributions of f1z, f2z, f3z and duration are all in the same direction (i.e., positive). But the amount of contribution is fairly small for f1z and duration compared to f2z and f3z.

  • PC2: In contrast to PC1, f1z contributes to PC2 to a greater extent than f2z and f3z. duration also seems to make a big contribution here but in the opposite direction (i.e., negative).

  • PC3: The contribution of duration seems quite big here, too.

  • PC4: f2z and f3z both seem to contribute to this dimension, although they are in antagonistic direction.

So, all in all, PCA might suggest:

  • PC1 captures a covariation between f2z and f3z, which may correspond to the spectral difference between /l/ and /ɹ/.

  • PC2 mainly captures a covariation of duration and f1z but in opposite directions.

  • PC3 shows variation in f1z.

  • PC4 again seems to capture f2z and f3z but in opposite directions.

The interpretation may be better facilitated by data visualisation. Let’s try two main approaches to data visualiastion of PCA.

Data visualisation 1: Scree plot

The amount of variance is useful when determining which PC to retain for analysis. As a rule of thumb, Bayeen (2008) recommends that we retain PCs that explains greater than 5% of variance in the data.

A useful visualisation of the proportion of variance is scree plot. There is a default function screeplot() that lets you create a scree plot quite easily:

# scree plot: bar chart
screeplot(pca_mid)

# scree plot: line plot
screeplot(pca_mid, type = "line")

You can also create a scree plot using ggplot() by manually calculating the proportion of variance based on the standard deviation (see above for the formula). This is useful when you need more flexibility – e.g., to show the 5% thereshold suggested by Bayeen (2008).

## obtaining proportion of variance 
pca_var_exp <- pca_mid$sdev^2 / sum(eigenvalues)

# making var_explained as a tibble 
pca_var_exp <- tibble::as_tibble(pca_var_exp)

# adding column name
pca_var_exp <- pca_var_exp |>  
  tibble::as_tibble() |>  
  dplyr::mutate(
    PC = row_number()
  )

# create a plot
scree_plot <- pca_var_exp |> 
  ggplot(aes(x = PC, y = value)) +
  geom_line() +
  geom_text(aes(label = round(value, digits = 3)*100), nudge_x = 0, nudge_y = -0.02) +
  geom_label(aes(label = PC), label.padding = unit(0.40, "lines")) +
  geom_hline(yintercept = 0.05, linetype = 'dotted') +
  scale_x_continuous(breaks = c(1, 2, 3)) +
  labs(x = "Principal Component", y = "Variance Explained", title = "Proportion of Variance explained by each PC")

# showing plot
scree_plot

Since there are only three PCs identified from the data, it doesn’t make much sense to visualise them. But this will be useful when prcomp() identifies more than a few PCs.

creating biplot

The interpretation of loading matrix was more or less straightforward in this case given that only three parameters were involved. It is easy to see, however, that this can get quite complicated when the original data have a greater number of dimensions.

Biplots help us better understand the loadings and PC scores for each observation. For example, the default function biplot() plots all observations along the PC1 and PC2 dimensions with eigenvectors shown in red arrows (i.e., the direction and the amount of weighting: also called loadings).

# default function for biplot
biplot(pca_mid)

Detailed biplots

Similarly to the scree plot, we could also create biplots manually using ggplot(). This will help us better understand what each PC could stand for in the current data set.

# Extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)

# Extract PCA loadings (the contribution of each original variable to the PCs)
pca_loadings <- as.data.frame(pca_mid$rotation)

# Combine scores into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel")])

# Reshape the loadings for visualization
loadings_df <- pca_loadings |> 
  mutate(variable = rownames(pca_loadings))

# Plotting the PCA biplot with facets by language
ggplot() +
  geom_point(data = scores_df, aes(x = PC1, y = PC2, color = segment, shape = segment), size = 2, alpha = 0.5) + # Plot the PCA scores (observations)
  geom_segment(data = loadings_df, aes(x = 0, y = 0, xend = PC1 * 5, yend = PC2 * 5), arrow = arrow(type = "closed", length = unit(0.1, "inches")), color = "red") + # Plot the loadings (variables)
  geom_text(data = loadings_df, aes(x = PC1 * 5, y = PC2 * 5, label = variable), size = 4, color = "red") + # Add labels for the loadings
  # stat_ellipse(data = scores_df, aes(x = PC1, y = PC2, color = vowel, fill = vowel), geom = "polygon", alpha = 0.2, level = 0.95) + 
  facet_grid(language ~ vowel) +
  labs(title = "PCA biplot by segment and vowel") +
  scale_color_manual(values = cbPalette) + 
  scale_fill_manual(values = cbPalette) + 
  theme(legend.position = "bottom",
        strip.text.y = element_text(angle = 360))

Using PC scores as data

So far, we have identified four PCs, each of which explains more than 5% of variance in the data. At first glance, this doesn’t seem to be a good way of data reduction, as we got four PC dimensions out of four variables (f1z, f2z, f3z, and duration).

However, we could also argue that data reduction has been indeed achieved in the sense that some PCs capture joint contributions of multiple parameters – e.g., PC1 showing covariation of f2z and f3z. In other words, we have identified in a data-driven manner that the two spectral parameters act together, and we have compressed the two parameters into one PC dimension (i.e., PC1).

One of the advantages of using PCA is that it converts the data onto new scales in a data-driven manner. More specifically, PCA associates each data point with new numeric values called PC scores, showing variation of each data point along each PC dimensions. As a common practice, we can use PC scores as input to further data visualisation and statistical analysis.

Extracting PC scores

PC scores are stored as x in the PCA output. Let’s use head() function to only display the first six rows as otherwise the list would be too big.

# displaying PC scores
head(pca_mid$x)
##             PC1         PC2         PC3        PC4
## [1,]  0.1030871 -0.46456046 1.842069714 -0.3630188
## [2,]  0.3992679 -0.34912107 1.606508257 -0.3671703
## [3,] -0.2207061  0.81275522 1.210015271 -1.0763739
## [4,]  0.1631096 -0.02578864 0.004715655 -1.2024286
## [5,]  0.2346244 -0.63551312 1.461971228 -0.9505229
## [6,]  0.3949973 -0.74218923 2.054526744 -0.1311160

Assuming that we haven’t made any changes to the order of rows, we can combine the PC scores with the existing data set. We have actually already done this when visualising the data, so I simply copy and paste the codes here:

# extract PCA scores (principal component scores for each observation)
pca_scores <- as.data.frame(pca_mid$x)

# check the number of rows
## pc score
nrow(pca_scores)
## [1] 2306
## main data set
nrow(df_mid)
## [1] 2306

Both pc_score and df_mid contain the same number of rows, so we can merge them into one data frame.

# combine scores into one data frame for plotting
scores_df <- cbind(pca_scores, df_mid[, c("segment", "language", "vowel", "speaker", "gender", "word")])

Check PC scores

Before visualisation, let’s take a look at descriptive statistics of PC scores. We’ll focus on the first two PCs only here.

scores_df |> 
  dplyr::group_by(language, segment, vowel) |> 
  dplyr::summarise(
    PC1_mean = mean(PC1),
    PC1_sd = sd(PC1),
    PC2_mean = mean(PC2),
    PC2_sd = sd(PC2)
  ) |> 
  dplyr::ungroup()
## `summarise()` has grouped output by 'language', 'segment'. You can override
## using the `.groups` argument.
## # A tibble: 12 × 7
##    language segment vowel PC1_mean PC1_sd PC2_mean PC2_sd
##    <chr>    <chr>   <chr>    <dbl>  <dbl>    <dbl>  <dbl>
##  1 English  L       /i/      0.917  0.785 -0.0570   0.923
##  2 English  L       /u/      0.965  0.762 -0.629    0.989
##  3 English  L       /æ/      0.797  0.593  0.104    1.08 
##  4 English  R       /i/     -0.632  0.768  0.0538   0.940
##  5 English  R       /u/     -0.808  0.685 -0.177    0.978
##  6 English  R       /æ/     -1.06   0.769  0.446    1.33 
##  7 Japanese L       /i/      0.898  0.753 -0.00896  0.927
##  8 Japanese L       /u/      0.278  0.956 -0.617    0.953
##  9 Japanese L       /æ/      0.585  0.890  0.146    0.995
## 10 Japanese R       /i/     -0.163  1.03   0.221    0.905
## 11 Japanese R       /u/     -0.901  0.907 -0.392    0.885
## 12 Japanese R       /æ/     -0.721  1.08   0.218    1.03

We get lots of numbers here – please do feel free to take a moment to digest these. But we could also visualise them to better facilitate the interpretation.

Visualising PC1

Let’s visualise PC scores to see if our assumptions about each PC is correct. We think that PC1 captures covariation of f2z and f3z, and this should correspond to the contrast between English /l/ (lower F2, higher F3) and /ɹ/ (higher F2, lower F3).

## PC1 - by liquid consonant
scores_df |> 
  ggplot(aes(x = segment, y = PC1, colour = segment)) +
  geom_jitter(width = 0.3) +
  geom_violin(alpha = 0.3) +
  geom_boxplot(width = 0.3, colour = "black") +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(language ~ vowel) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

This is more or less true, good! We could also make a between-group comparison.

## PC1 - by group
scores_df |> 
  ggplot(aes(x = language, y = PC1, colour = language)) +
  geom_jitter(width = 0.3) +
  geom_violin(alpha = 0.3) +
  geom_boxplot(width = 0.3, colour = "black") +
  geom_hline(yintercept = 0, linetype = "dashed", linewidth = 0.3) +
  scale_colour_manual(values = cbPalette) +
  facet_grid(segment ~ vowel) +
  theme(
    strip.text.y = element_text(angle = 360)
  )

We can see some patterns here: e.g., both L1 English and L1 Japanese speakers are similar for /l/, L1 Japanese speakers produce smaller PC1 scores in the /u/ context than L1 English speakers. Some between-group differences can be found for English /ɹ/ as well.

Visualising PC2

Your turn

In a similar manner to PC1, please visualise PC2, PC3 and PC4. Please try different visualisation methods to explore what each PC dimension highlights.

PC2 mainly captured covariation between duration and f1z. This is a little less intuitive, but previous research has indeed shown that L1 English and L1 Japanese speakers may exhibit different F1 transition duration profiles. So, it would be interesting to visualise PC2 to show between-group differences:

## PC2
### scores_df |> ...

Statistical analysis

As I said earlier, as a somewhat common practice, PC scores can serve as input to further statistical analysis. So far, we have found that PC1 captures the contrast between English /l/ and /ɹ/; although this is sort of obvious, the most important message here is that the data tells us about it without any apriori knowledge.

In the data visualisation above, it was somewhat difficult to identify which of these variables: language, vowel and segment, would have a significant effect. At this point, we could construct a linear mixed-effect model to test whether these variables have a statistically significant effect on the PC scores.

I don’t show the full results here because it’s too long, but more to come tomorrow when we come back to talk about LMEs!

# converting variables into factor
scores_df <- scores_df |> 
  dplyr::mutate(
    language = as.factor(language),
    vowel = as.factor(vowel),
    segment = as.factor(segment),
    speaker = as.factor(speaker),
    word = as.factor(word)
  )

# run a full model
m1 <- lme4::lmer(PC1 ~ language + segment + vowel + language:segment + language:vowel + (1|speaker) + (1|word), data = scores_df, REML = FALSE)

# model summary
summary(m1)

# nested model 1 -- testing the language-segment interaction
m2 <- lme4::lmer(PC1 ~ language + segment + vowel + language:vowel + (1|speaker) + (1|word), data = scores_df, REML = FALSE)

# model comparison
anova(m1, m2, test = "Chisq")

# nested model 2 -- testing the language-vowel interaction
m3 <- lme4::lmer(PC1 ~ language + segment + vowel + language:segment + (1|speaker) + (1|word), data = scores_df, REML = FALSE)

# model comparison
anova(m1, m3, test = "Chisq")

# post-hoc analysis
## between-group effects of each language
emmeans::emmeans(m1, pairwise ~ language | segment:vowel, adjust = "tukey")

Saving the model

It will be interesting to fit a linear mixed-effect model on PC scores tomorrow. So let’s save the model scores_df so that we can retrieve it easily when we need it.

save(scores_df, file = "data/pca_scores_df.rda")

Summary

I hope I have shown that PCA is a strong tool to identify salient dimensions in the data in a bottom-up/data-driven manner. It projects the data onto a new dimension based on the identified principal components. Furthermore, we can obtain PC scores for each observation along the PC dimensions, and we can use them as input to further statistical analysis. The flexibility in the choice of statistical modelling is an advantage of PCA.

Summarising what we have found so far:

  • PCA here identifies four principal components (PCs) that explains 32.76% (PC1), 26.92% (PC2), 24.64% (PC3) and 15.68% (PC4) of the variance in the data.

  • The largest proportion of variance is explained by PC1 that captures a covariation of F2 and F3 as suggested by the biplot. The data visualisation clearly shows that English /l/ and /ɹ/ can be distinguished along PC1 (English /l/: higher PC1 values, English /ɹ/: lower PC1 values)

  • We can also use PC scores as data for further statistical analysis. This flexibility is where PCA shines!

Wrap-up question

What acoustic parameters would you like to test using linear mixed-effect models tomorrow?

Session Information

sessionInfo()
## R version 4.3.2 (2023-10-31)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.3.2
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Europe/London
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] emmeans_1.8.9    lmerTest_3.1-3   lme4_1.1-35.1    Matrix_1.6-1.1  
##  [5] ggpubr_0.6.0     lubridate_1.9.4  forcats_1.0.0    stringr_1.5.1   
##  [9] dplyr_1.1.4      purrr_1.0.2      readr_2.1.4      tidyr_1.3.1     
## [13] tibble_3.2.1     ggplot2_3.5.1    tidyverse_2.0.0  rmdformats_1.0.4
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6        xfun_0.50           bslib_0.7.0        
##  [4] rstatix_0.7.2       lattice_0.21-9      numDeriv_2016.8-1.1
##  [7] tzdb_0.4.0          vctrs_0.6.5         tools_4.3.2        
## [10] generics_0.1.3      pbkrtest_0.5.2      parallel_4.3.2     
## [13] highr_0.10          pkgconfig_2.0.3     lifecycle_1.0.4    
## [16] compiler_4.3.2      farver_2.1.2        munsell_0.5.1      
## [19] carData_3.0-5       htmltools_0.5.8.1   sass_0.4.9         
## [22] yaml_2.3.8          crayon_1.5.2        nloptr_2.0.3       
## [25] pillar_1.10.1       car_3.1-2           jquerylib_0.1.4    
## [28] MASS_7.3-60         cachem_1.0.8        boot_1.3-28.1      
## [31] abind_1.4-8         nlme_3.1-163        tidyselect_1.2.1   
## [34] digest_0.6.36       mvtnorm_1.2-5       stringi_1.8.4      
## [37] bookdown_0.42       labeling_0.4.3      splines_4.3.2      
## [40] cowplot_1.1.3       fastmap_1.1.1       grid_4.3.2         
## [43] colorspace_2.1-1    cli_3.6.4           magrittr_2.0.3     
## [46] utf8_1.2.4          broom_1.0.5         withr_3.0.2        
## [49] scales_1.3.0        backports_1.5.0     bit64_4.0.5        
## [52] estimability_1.4.1  timechange_0.3.0    rmarkdown_2.26     
## [55] bit_4.0.5           gridExtra_2.3       ggsignif_0.6.4     
## [58] hms_1.1.3           coda_0.19-4.1       evaluate_0.23      
## [61] knitr_1.45          mgcv_1.9-0          rlang_1.1.5        
## [64] Rcpp_1.0.14         xtable_1.8-4        glue_1.8.0         
## [67] vroom_1.6.5         minqa_1.2.6         rstudioapi_0.15.0  
## [70] jsonlite_1.8.8      R6_2.6.1